QC and filter seurat object¶

In [1]:
library_load <- suppressMessages(
    
    list(
        
        # Seurat 
        library(Seurat), 
        
        # Scran doubletFinder implementation 
        library(scDblFinder), 
        
        # FastMNN implementation 
        library(SeuratWrappers), 
        
        # Data 
        library(tidyverse), 
        
        # Reticulate
        library(reticulate), 
        
        # Biomart 
        library(biomaRt), 
        
        # Plotting 
        library(ggplot2), 
        library(knitr),
        library(kableExtra),
        library(IRdisplay)
        
    )
)
In [2]:
random_seed <- 42
set.seed(random_seed)
In [3]:
options(warn=-1)
In [4]:
# Set working directory to project root
setwd("/research/peer/fdeckert/FD20200109SPLENO")
In [5]:
# Source files
source("plotting_global.R")
source("bin/SeuratQC.R")

# SingleRPlot
source("bin/SingleRQC.R")

Parameter settings¶

In [6]:
# Files 
so_raw_file <- "data/object/raw.rds"

so_qc_rds <- "data/object/qc.rds"
so_qc_h5ad <- "data/object/qc.h5ad"

# Plotting Theme
ggplot2::theme_set(theme_global_set()) # From project global source()

Import Seurat object¶

In [7]:
so_raw <- readRDS(so_raw_file)
In [8]:
so_raw$tissue <- factor(so_raw$tissue, levels=c("Myeloid", "Progenitor"))
so_raw$treatment <- factor(so_raw$treatment, levels=c("NaCl", "CpG"))

Rank plot¶

Empty droplets were determined with CellRanger V3.0.2 Lun et al., 2019 EmptyDrop heuristic. RNAse activity of granulocytes might be wrongly identified as empty cells by CellRanger.

Typical Sample A steep drop-off is indicative of good separation between the cell-associated barcodes and the barcodes associated with empty GEMs. A ideal barcode rank plot has a distincitve shape, which is referred to as a "cliff and knee".
Heterogeneous Sample Heterogeneous populations of cells in a sample result in two "cliff and knee" distributions. However, there should still be clear separation between the bacodes.
Compromised Sample Round curve and lack of steep cliff may indicate low sample quality or loss of single-cell behavior. This can be due to a wetting failure, premature cell lysis, or low cell viability.
Compromised Sample Defined cliff and knee, but the total number of barcodes detected may be lower than expected. This can be caused by a sample clog or inaccurate cell count.

In [9]:
options(repr.plot.width=10, repr.plot.height=5)
rank_plot_qc_1 <- rank_plot_qc(so_raw, color_color=color$cellranger_class, formular=tissue~sample_rep+treatment)
rank_plot_qc_1
ggsave(rank_plot_qc_1, filename="result/qc/plot/rank_plot_qc.png", width=10, height=5)

Filter by CellRanger class¶

In [10]:
so_qc <- subset(so_raw, subset=cellranger_class == "Cell")

Compute cell cycle score¶

In [11]:
# Get Seurat cell cycle genes 
cc_genes_seurat_s <- cc.genes.updated.2019$s.genes
cc_genes_seurat_g2m <- cc.genes.updated.2019$g2m.genes

# Get mouse orthologs from human gene simbols
httr::set_config(httr::config(ssl_verifypeer=FALSE))

hgnc_mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host="https://dec2021.archive.ensembl.org/")
mm_mart <- useMart("ensembl", dataset="mmusculus_gene_ensembl", host="https://dec2021.archive.ensembl.org/")

cc_genes_seurat_s <- getLDS(attributes=c("hgnc_symbol"), filters="hgnc_symbol", values=cc_genes_seurat_s, mart=hgnc_mart, attributesL=c("mgi_symbol"), martL=mm_mart, uniqueRows=TRUE)
cc_genes_seurat_s <- cc_genes_seurat_s[, 2]
    
cc_genes_seurat_g2m <- getLDS(attributes=c("hgnc_symbol"), filters="hgnc_symbol", values=cc_genes_seurat_g2m, mart=hgnc_mart, attributesL=c("mgi_symbol"), martL=mm_mart, uniqueRows=TRUE)
cc_genes_seurat_g2m <- cc_genes_seurat_g2m[, 2]

# Compute cell cycle score 
so_qc <- CellCycleScoring(so_qc, s.features=cc_genes_seurat_s, g2m.features=cc_genes_seurat_g2m, set.ident=FALSE)

colnames(so_qc@meta.data) <- gsub("Phase", "cc_phase_class", colnames(so_qc@meta.data))
colnames(so_qc@meta.data) <- gsub("S.Score", "msS_RNA", colnames(so_qc@meta.data))
colnames(so_qc@meta.data) <- gsub("G2M.Score", "msG2M_RNA", colnames(so_qc@meta.data)) 
    
so_qc$msCC_diff_RNA <- so_qc$msS_RNA - so_qc$msG2M_RNA
so_qc$cc_phase_class <- factor(so_qc$cc_phase_class, levels=names(color$cc_phase_class))

SingleR annotation¶

SingleR identifies marker genes from the reference and uses them to compute assignment score (based on the Spearman correlation across markers) for each cell in the test dataset against each label in the reference. The label with the highest doublet_stat is the assigned to the test cell, possibly with further fine-tuning to resolve closely related labels.

first.labels: Labels before fine-tuning
labels: Labels after fine-tuning
pruning: labels after pruning

ImmGen¶

In [12]:
if(FALSE) {
    
    # Load reference data 
    ref <- ImmGenData()
    
    # Seurat object to SingleCellExperiment
    sce <- SingleCellExperiment(list(counts=so_qc@assays$RNA@counts))

    # # Predict labels
    label_main <- SingleR::SingleR(test=sce, ref=ref, labels=ref$label.main, assay.type.test="counts", de.method="classic")
    label_fine <- SingleR::SingleR(test=sce, ref=ref, labels=ref$label.fine, assay.type.test="counts", de.method="classic")

    saveRDS(label_main, "result/SingleR/label_main_immgen.rds")
    saveRDS(label_fine, "result/SingleR/label_fine_immgen.rds")

} else {
    
    label_main <- readRDS("result/SingleR/label_main_immgen.rds")
    label_fine <- readRDS("result/SingleR/label_fine_immgen.rds")
    
}

# Add labels to Seurat object
label_main_meta <- as.data.frame(label_main) %>% dplyr::select(pruned.labels, tuning.scores.first) %>% dplyr::rename(label_main_immgen=pruned.labels, delta_score_main_immgen=tuning.scores.first)
label_fine_meta <- as.data.frame(label_fine) %>% dplyr::select(pruned.labels, tuning.scores.first) %>% dplyr::rename(label_fine_immgen=pruned.labels, delta_score_fine_immgen=tuning.scores.first)

so_qc <- AddMetaData(so_qc, cbind(label_main_meta, label_fine_meta))

Haemosphere¶

In [13]:
if(FALSE) {
    
    # Load reference data 
    ref <- readRDS("data/haemosphere/se_haemosphere.rds")
    
    # Seurat object to SingleCellExperiment
    sce <- SingleCellExperiment(list(counts=so_qc@assays$RNA@counts))

    # # Predict labels
    label_main <- SingleR::SingleR(test=sce, ref=ref, labels=ref$label.main, assay.type.test="counts", de.method="classic")
    label_fine <- SingleR::SingleR(test=sce, ref=ref, labels=ref$label.fine, assay.type.test="counts", de.method="classic")

    saveRDS(label_main, "result/SingleR/label_main_haemosphere.rds")
    saveRDS(label_fine, "result/SingleR/label_fine_haemosphere.rds")

} else {
    
    label_main <- readRDS("result/SingleR/label_main_haemosphere.rds")
    label_fine <- readRDS("result/SingleR/label_fine_haemosphere.rds")
    
}

# Add labels to Seurat object
label_main_meta <- as.data.frame(label_main) %>% dplyr::select(pruned.labels, tuning.scores.first) %>% dplyr::rename(label_main_haemosphere=pruned.labels, delta_score_main_haemosphere=tuning.scores.first)
label_fine_meta <- as.data.frame(label_fine) %>% dplyr::select(pruned.labels, tuning.scores.first) %>% dplyr::rename(label_fine_haemosphere=pruned.labels, delta_score_fine_haemosphere=tuning.scores.first)

so_qc <- AddMetaData(so_qc, cbind(label_main_meta, label_fine_meta))
In [14]:
options(repr.plot.width=10, repr.plot.height=10)
ggplot(so_qc@meta.data, aes(y=label_fine_haemosphere, fill=label_fine_haemosphere)) + 
    geom_bar() + 
    xlab("log10(cell count)") + ylab("") + ggtitle("Fine Haemosphere label") +
    scale_x_log10(breaks=c(0, 10, 100, 1000)) +
    facet_grid(tissue~treatment+sample_rep) + 
    scale_fill_manual(values=color$label_fine_haemosphere)

Set Seurat object QC class¶

We recognize a substantial number or cells annotated as erythrocytes in the Myeloid sort. Since we do not expect those cells based on the FACS panel they are probably doublet cells or phagocytes which took up a erythrocyte. As shown in the scatter plots below those cells show either a very high or low feature count respectively. While it might be of interest keeping the phagocytes the low number of features detected – probably to the exaggerated Hb mRNA count – makes downstream analysis difficult. We therefore exclude cells from the Myeloid sort which were annotated as Ery.

In [15]:
qc_class_set <- function(so) {
    
    # QC filter for all cells
    so$pMt_RNA_min <- -Inf
    so$pMt_RNA_max <- 5
    
    so$nCount_RNA_min <- 1500
    so$nCount_RNA_max <- max(so$nCount_RNA)

    so$qc_class <- ifelse(
        so$cellranger_class == "Cell" & 
        so$nCount_RNA <= so$nCount_RNA_max & 
        so$nCount_RNA > so$nCount_RNA_min &
        so$pMt_RNA <= so$pMt_RNA_max &
        so$pMt_RNA > so$pMt_RNA_min,
        "pass", "fail"
      )
    
    # Remove Ery from Myeloid sort before computing nFeature_RNA
    if(so$tissue[1]=="Myeloid") {
        
        so$qc_class <- ifelse(
            so$label_main_haemosphere=="Ery" | so$label_fine_haemosphere %in% c("preCFUE", "CFUE", "pbEry", "poEry", "Retic") | so$pHb_RNA >= 5, 
            "fail", so$qc_class
        )
        
    }
    
    so_tmp <- subset(so, subset=qc_class=="pass")
    
    so$nFeature_RNA_max <- quantile(so_tmp$nFeature_RNA, 0.99)
    so$nFeature_RNA_min <- quantile(so_tmp$nFeature_RNA, 0.01)
    
    so$qc_class <- ifelse(
        so$qc_class == "pass" &
        so$nFeature_RNA <= so$nFeature_RNA_max & 
        so$nFeature_RNA > so$nFeature_RNA_min, 
        "pass", "fail"
      )
    
    return(so)

}
In [16]:
so_qc <- Seurat::SplitObject(so_qc, split.by="sample_name")
so_qc <- lapply(so_qc, qc_class_set)
so_qc <- merge(so_qc[[1]], so_qc[2:length(so_qc)])
In [17]:
so_qc$tissue <- factor(so_qc$tissue, levels=c("Myeloid", "Progenitor"))
so_qc$treatment <- factor(so_qc$treatment, levels=c("NaCl", "CpG"))

QC plots¶

Density plot¶

In [18]:
density_plot_qc_1 <- density_plot_qc(so=so_qc, title="Density plot UMI count", x=nCount_RNA, xlab="log10(UMI count)", min=nCount_RNA_min, max=nCount_RNA_max, fill_color=color$tissue)
density_plot_qc_2 <- density_plot_qc(so=so_qc, title="Density plot Feature count", x=nFeature_RNA, xlab="log10(Feature count)", min=nFeature_RNA_min, max=nFeature_RNA_max, fill_color=color$tissue)
density_plot_qc_3 <- density_plot_qc(so=so_qc, title="Density plot Mt %", x=pMt_RNA, xlab="Mt [%]", min=0, max=pMt_RNA_max, log10=FALSE, fill_color=color$tissue)
In [19]:
ggsave(density_plot_qc_1, filename="result/qc/plot/density_plot_qc_umi.png", width=10, height=5)
ggsave(density_plot_qc_2, filename="result/qc/plot/density_plot_qc_feature.png", width=10, height=5)
ggsave(density_plot_qc_3, filename="result/qc/plot/density_plot_qc_mt.png", width=10, height=5)
In [20]:
options(repr.plot.width=30, repr.plot.height=5)
density_plot_qc_1 + density_plot_qc_2 + density_plot_qc_3 + plot_layout(nrow=1) & theme(legend.position="bottom")

Scattern plot¶

In [21]:
scattern_plot_qc_1 <- scattern_plot_qc(so=so_qc, title="Mitochondrial gene percentage", color=pMt_RNA)
scattern_plot_qc_2 <- scattern_plot_qc(so=so_qc, title="Hemoglobin gene percentage", color=pHb_RNA)
scattern_plot_qc_3 <- scattern_plot_qc(so=so_qc, title="Ribsonmal gene percentage", color=pRb_RNA)
scattern_plot_qc_4 <- scattern_plot_qc(so=so_qc, title="Fine labels", color=label_fine_haemosphere) + scale_color_manual(values=color$label_fine_haemosphere)
scattern_plot_qc_5 <- scattern_plot_qc(so=so_qc, title="QC class", color=qc_class) + scale_color_manual(values=c("fail"="red", "pass"="black"))
In [22]:
ggsave(scattern_plot_qc_1, filename="result/qc/plot/sc_mt.png", width=10, height=5)
ggsave(scattern_plot_qc_2, filename="result/qc/plot/sc_hg.png", width=10, height=5)
ggsave(scattern_plot_qc_3, filename="result/qc/plot/sc_rb.png", width=10, height=5)
ggsave(scattern_plot_qc_4, filename="result/qc/plot/sc_fl.png", width=10, height=5)
ggsave(scattern_plot_qc_5, filename="result/qc/plot/sc_qc.png", width=10, height=5)
In [23]:
options(repr.plot.width=15, repr.plot.height=15)
scattern_plot_qc_1 + scattern_plot_qc_2 + scattern_plot_qc_3 + scattern_plot_qc_4 + scattern_plot_qc_5 + plot_layout(nrow=3) & theme(legend.position="bottom")

Box plots¶

In [24]:
box_plot_qc_1 <- box_plot_qc(so=so_qc, y=nCount_RNA, fill=tissue, ylab="UMI [count]", ymin=0)
box_plot_qc_2 <- box_plot_qc(so=so_qc, y=nFeature_RNA, fill=tissue, ylab="Feature [count]", ymin=0)
box_plot_qc_3 <- box_plot_qc(so=so_qc, y=pMt_RNA, fill=tissue, ylab="Mt [%]", ymin=0, ymax=100)
box_plot_qc_4 <- box_plot_qc(so=so_qc, y=pHb_RNA, fill=tissue, ylab="Hb [%]", ymin=0, ymax=100)
box_plot_qc_5 <- box_plot_qc(so=so_qc, y=pRb_RNA, fill=tissue, ylab="Rbl [%]", ymin=0, ymax=100)
In [25]:
ggsave(box_plot_qc_1[[1]] + box_plot_qc_1[[2]] + plot_layout(guides="collect", ncol=1) & theme(legend.position="bottom"), filename="result/qc/plot/box_plot_qc_1.png", width=7.5, height=5)
ggsave(box_plot_qc_2[[1]] + box_plot_qc_2[[2]] + plot_layout(guides="collect", ncol=1) & theme(legend.position="bottom"), filename="result/qc/plot/box_plot_qc_2.png", width=7.5, height=5)
ggsave(box_plot_qc_3[[1]] + box_plot_qc_3[[2]] + plot_layout(guides="collect", ncol=1) & theme(legend.position="bottom"), filename="result/qc/plot/box_plot_qc_3.png", width=7.5, height=5)
ggsave(box_plot_qc_4[[1]] + box_plot_qc_4[[2]] + plot_layout(guides="collect", ncol=1) & theme(legend.position="bottom"), filename="result/qc/plot/box_plot_qc_4.png", width=7.5, height=5)
ggsave(box_plot_qc_5[[1]] + box_plot_qc_5[[2]] + plot_layout(guides="collect", ncol=1) & theme(legend.position="bottom"), filename="result/qc/plot/box_plot_qc_5.png", width=7.5, height=5)
In [26]:
options(repr.plot.width=15, repr.plot.height=20)
box_plot_qc_1[[1]] + box_plot_qc_1[[2]] + box_plot_qc_2[[1]]  + box_plot_qc_2[[2]] + 
box_plot_qc_3[[1]] + box_plot_qc_3[[2]] + box_plot_qc_4[[1]]  + box_plot_qc_4[[2]] + 
box_plot_qc_5[[1]] + box_plot_qc_5[[2]] + plot_spacer() + plot_spacer() + plot_layout(guides="collect", ncol=2) & theme(legend.position="none")

Filter by QC class¶

In [27]:
so_qc <- subset(so_qc, subset = qc_class=="pass")

UMAP and clustering of all samples on all genes¶

In [28]:
so_qc <- NormalizeData(so_qc, assay="RNA")
so_qc <- ScaleData(so_qc, assay="RNA", verbose=FALSE)
so_qc <- FindVariableFeatures(so_qc, nfeatures=3000, assay="RNA", verbose=FALSE)
so_qc <- RunPCA(so_qc, npcs=50, assay="RNA", features=NULL, verbose=FALSE)
so_qc <- FindNeighbors(so_qc, dims=1:10, k.param=20, verbose=FALSE)
so_qc <- FindClusters(so_qc, verbose=FALSE, resolution=1, algorithm=1, group.singletons=TRUE)
so_qc <- RunUMAP(so_qc, dims=1:50, n.neighbors=30, verbose=FALSE)
In [29]:
options(repr.plot.width=22.5, repr.plot.height=15)

dplot_1 <- dplot(so_qc, reduction="umap", group_by="seurat_clusters", label=TRUE) + ggtitle("Louvain cluster")
dplot_2 <- dplot(so_qc, reduction="umap", group_by="tissue") + scale_color_manual(values=color$tissue) + ggtitle("Tissue")
dplot_3 <- dplot(so_qc, reduction="umap", group_by="treatment") + scale_color_manual(values=color$treatment) + ggtitle("Treatment")
dplot_4 <- dplot(so_qc, reduction="umap", group_by="cc_phase_class") + scale_color_manual(values=color$cc_phase_class) + ggtitle("Cell cycle")
dplot_5 <- dplot(so_qc, reduction="umap", group_by="label_main_haemosphere") + scale_color_manual(values=color$label_main_haemosphere) + ggtitle("Haemosphere label (main)")
dplot_6 <- dplot(so_qc, reduction="umap", group_by="label_fine_haemosphere") + scale_color_manual(values=color$label_fine_haemosphere) + ggtitle("Haemosphere label (fine)")
dplot_7 <- dplot(so_qc, reduction="umap", group_by="sample_rep") + ggtitle("Replicate")
fplot_1 <- fplot(so_qc, reduction="umap", features="Trac") + scale_colour_viridis_c(option="inferno") + ggtitle("Trac")
fplot_2 <- fplot(so_qc, reduction="umap", features="Igkc") + scale_colour_viridis_c(option="inferno") + ggtitle("Igkc")

dplot_comb <- dplot_1 + dplot_2 + dplot_3 + dplot_4 + dplot_5 + dplot_6 + dplot_7 + fplot_1 + fplot_2 + plot_layout(ncol=3)
dplot_comb
Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

In [30]:
so_qc@meta.data %>% dplyr::group_by(seurat_clusters, treatment, sample_rep) %>% 
    dplyr::summarise(n=n()) %>% data.frame() %>% 
    tidyr::spread(seurat_clusters, n) %>% 
    kableExtra::kable("html") %>% as.character() %>% IRdisplay::display_html()
`summarise()` has grouped output by 'seurat_clusters', 'treatment'. You can
override using the `.groups` argument.
treatment sample_rep 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
NaCl Rep1 482 22 16 102 120 288 156 112 503 56 300 113 11 128 160 79 67 92 139 42 112 22 8
NaCl Rep2 280 82 119 40 113 554 737 565 43 82 170 262 101 135 470 120 229 65 65 62 27 67 29
CpG Rep1 531 529 239 794 426 131 79 100 457 358 189 164 154 209 38 195 108 78 129 106 90 13 20
CpG Rep2 154 744 950 310 549 142 127 256 22 483 317 380 569 316 74 316 271 194 64 80 15 1 42

Doublet detection¶

I use the strategy from Sala et al., 2019 which was also used in Barile et al., 2021 with the source documented here. We only do the refined method with combined samples. The functions used in the script are documented here. We do doublet detection on sample groups (Myeloid and Progenitor cells of a replicated) and integrated those groups per batch and then by treatment with FastMNN

Doublet score¶

In [31]:
# Split data by sample group 
so_qc <- SplitObject(so_qc, split.by="sample_group")

# Compute doublet density per sample group 
so_qc <- lapply(so_qc, function(so) {
    
    DefaultAssay(so) <- "RNA"
    
    # Remove empty genes from split 
    cnt <- GetAssayData(so, assay="RNA", slot="counts")
    cnt <- cnt[rowSums(cnt) > 0, ]
    
    # Highly variable features 
    so <- ScaleData(so, features=rownames(cnt), assay="RNA", verbose=FALSE)
    so <- FindVariableFeatures(so, assay="RNA", nfeatures=2000)
    
    # Compute doublet density 
    so$doublet_score <- scDblFinder::computeDoubletDensity(cnt, subset.row=VariableFeatures(so), size.factors.norm=so$nCount_RNA, dims=50)
    so$doublet_score_log2 <- log2(so$doublet_score)
    
    return(so)
    
}
               )

Sample group doublets¶

In [32]:
# Compute sub-cluster per sample group 
so_qc <- lapply(so_qc, function(so) {
    
    # Set default assay 
    DefaultAssay(so) <- "RNA"
    
    # Highly variable genes across all data sets 
    cnt <- GetAssayData(so, assay="RNA", slot="data")
    cnt <- cnt[rowSums(cnt)>0, ]
    
    # Cluster function     
    so <- ScaleData(so, features=rownames(cnt), assay="RNA", verbose=FALSE)
    so <- RunPCA(so, npcs=50, features=rownames(cnt), assay="RNA", verbose=FALSE)
    so <- FindNeighbors(so, dims=1:50, k.param=15, reduction="pca", verbose=FALSE)
    so <- FindClusters(so, verbose=FALSE, resolution=1, algorithm=1)

    # Compute sub-cluster 
    cluster_list <- list()
    for(cluster_id in unique(so$seurat_clusters)) {
        
        sink("NUL")
        so <- FindSubCluster(so, cluster_id, graph.name="RNA_snn", subcluster.name="doublet_cluster", resolution=1, algorithm=1)
        sink()

        cluster <- so@meta.data[so$seurat_clusters==cluster_id, ] 
        cluster <- cluster[, c("cell_id", "doublet_cluster")]

        cluster_list[[cluster_id]] <- cluster

    }

    cluster <- do.call(rbind, cluster_list)
    rownames(cluster) <- cluster$cell_id
    cluster <- cluster[, "doublet_cluster", drop=FALSE]

    so <- AddMetaData(so, cluster)
    
    return(so)
    
}
               )
In [33]:
doublet_stat <- lapply(so_qc, function(so) {
    
    meta <- so@meta.data
    
    # Function to computes MAD for values of x greater the median
    mad_upper <- function(x){

        x <- x-median(x)
        x_mad <- mad(x[x>0], center=0)
        return(x_mad)

    }

    # Comput median doublet density per cluster
    doublet_stat <- dplyr::group_by(meta, doublet_cluster, sample_group) %>% 
        dplyr::summarise(doublet_cluster_density=median(doublet_score))

    # Compute median-centered, MAD-estimated variance normal distribution of cluster medians. 
    doublet_stat$p=pnorm(
        
        doublet_stat$doublet_cluster_density, 
        mean=median(doublet_stat$doublet_cluster_density), 
        sd=mad_upper(doublet_stat$doublet_cluster_density), 
        lower.tail=FALSE
        
    )
    
    return(doublet_stat)
    
}
               )

doublet_stat <- do.call(rbind, doublet_stat)
doublet_stat$fdr=p.adjust(doublet_stat$p, method="fdr")
    
# Call doublet cluster in meta data 
so_qc <- lapply(names(so_qc), function(sample_group){
    
    so <- so_qc[[sample_group]]
    
    so$doublet_class <- ifelse(so$doublet_cluster %in% doublet_stat[doublet_stat$fdr < 0.1 & doublet_stat$sample_group==sample_group, ]$doublet_cluster, "Doublet", "Singlet")
    
    return(so)
    
}
              )

names(so_qc) <- lapply(so_qc, function(so) {so$sample_group[1]})
`summarise()` has grouped output by 'doublet_cluster'. You can override using
the `.groups` argument.
`summarise()` has grouped output by 'doublet_cluster'. You can override using
the `.groups` argument.
`summarise()` has grouped output by 'doublet_cluster'. You can override using
the `.groups` argument.
`summarise()` has grouped output by 'doublet_cluster'. You can override using
the `.groups` argument.

Treatment group doublets¶

In [34]:
# FastMNN on batch  
so_qc_1 <- RunFastMNN(so_qc[c("NaCl_Rep2", "NaCl_Rep1")], assay="RNA", reconstructed.assay="MNN")
so_qc_2 <- RunFastMNN(so_qc[c("CpG_Rep2", "CpG_Rep1")], assay="RNA", reconstructed.assay="MNN")

# FastMNN on treatment groups 
so_qc <- RunFastMNN(list(so_qc_1, so_qc_2), assay="MNN", reconstructed.assay="MNN")
    
DefaultAssay(so_qc) <- "MNN"

so_qc <- FindNeighbors(so_qc, dims=1:50, k.param=15, reduction="mnn", graph.name=c("MNN_nn", "MNN_snn"), verbose=FALSE)
so_qc <- FindClusters(so_qc, verbose=FALSE, resolution=1, algorithm=1, graph.name="MNN_snn") 
so_qc <- RunUMAP(so_qc, dims=1:50, n.neighbors=30, reduction="mnn", verbose=FALSE)

so_qc$doublet_cluster_int <- so_qc$seurat_clusters

DefaultAssay(so_qc) <- "RNA"
Computing 2000 integration features

Computing 2000 integration features

Computing 2000 integration features

In [35]:
# Function to computes MAD for values of x greater the median
mad_upper <- function(x){
    
    x <- x-median(x)
    x_mad <- mad(x[x>0], center=0)
    return(x_mad)

}
    
doublet_stat <- dplyr::group_by(so_qc@meta.data, doublet_cluster_int) %>% 
    dplyr::mutate(cluster_count=n()) %>% dplyr::ungroup() %>% 
    dplyr::group_by(doublet_cluster_int, doublet_class, cluster_count) %>% 
    dplyr::summarise(doublet_ratio=n()) %>% dplyr::ungroup() %>% 
    dplyr::mutate(doublet_ratio=doublet_ratio/cluster_count) %>% 
    dplyr::filter(doublet_class=="Doublet")

doublet_stat$p=pnorm(doublet_stat$doublet_ratio, mean=median(doublet_stat$doublet_ratio), sd=mad_upper(doublet_stat$doublet_ratio), lower.tail=FALSE)
doublet_stat$fdr=p.adjust(doublet_stat$p, method="fdr")
    
so_qc$doublet_class_int <- ifelse(so_qc$doublet_cluster_int %in% doublet_stat[doublet_stat$fdr < 0.1, ]$doublet_cluster_int, "Doublet", "Singlet")
`summarise()` has grouped output by 'doublet_cluster_int', 'doublet_class'. You
can override using the `.groups` argument.
In [36]:
options(repr.plot.width=22.5, repr.plot.height=20)

dplot_1 <- dplot(so_qc, reduction="umap", group_by="treatment") + scale_color_manual(values=color$treatment) + ggtitle("Treatment")
dplot_2 <- dplot(so_qc, reduction="umap", group_by="tissue") + scale_color_manual(values=color$tissue) + ggtitle("Tissue")
dplot_3 <- dplot(so_qc, reduction="umap", group_by="cc_phase_class") + scale_color_manual(values=color$cc_phase_class) + ggtitle("Cell cycle")
dplot_4 <- dplot(so_qc, reduction="umap", group_by="label_fine_haemosphere") + scale_color_manual(values=color$label_fine_haemosphere) + ggtitle("Haemosphere label")
fplot_1 <- fplot(so_qc, reduction="umap", features="doublet_score_log2") + scale_colour_viridis_c(option="inferno") + ggtitle("Log2(Doublet score)")
dplot_5 <- dplot(so_qc, reduction="umap", group_by="doublet_class") + scale_color_manual(values=c("Doublet"="black", "Singlet"="light grey")) + ggtitle("Doublet class sample")
dplot_6 <- dplot(so_qc, reduction="umap", group_by="doublet_class_int") + scale_color_manual(values=c("Doublet"="black", "Singlet"="light grey")) + ggtitle("Doublet class integrated")
fplot_2 <- fplot(so_qc, reduction="umap", features="pMt_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Mt")
fplot_3 <- fplot(so_qc, reduction="umap", features="pHb_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Hb")
fplot_4 <- fplot(so_qc, reduction="umap", features="pRb_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Rb")
fplot_5 <- fplot(so_qc, reduction="umap", features="nCount_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("UMI count")
fplot_6 <- fplot(so_qc, reduction="umap", features="nFeature_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Feature count")

dplot_comb <- dplot_1 + dplot_2 + dplot_3 + dplot_4 + fplot_1 + dplot_5 + dplot_6 + fplot_2 + fplot_3 + fplot_4 + fplot_5 + fplot_6 + plot_layout(ncol=3)
dplot_comb
Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

In [37]:
options(repr.plot.width=22.5, repr.plot.height=5)

dplot_1 <- dplot(so_qc, reduction="umap", group_by="seurat_clusters", label=TRUE) + ggtitle("Louvain cluster")
fplot_1 <- fplot(so_qc, reduction="umap", features="Trac") + scale_colour_viridis_c(option="inferno") + ggtitle("Trac")
fplot_2 <- fplot(so_qc, reduction="umap", features="Igkc") + scale_colour_viridis_c(option="inferno") + ggtitle("Igkc")

dplot_comb <- dplot_1 + fplot_1 + fplot_2 + plot_layout(ncol=3)
dplot_comb
Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Filter out doublets, T- and B- lymphocytes¶

In [38]:
so_qc <- subset(so_qc, subset=doublet_class=="Singlet")
so_qc <- subset(so_qc, subset=doublet_class_int=="Singlet")
so_qc <- subset(so_qc, subset=label_fine_haemosphere!="CLP") # Filter out T lymphocytes
so_qc <- subset(so_qc, subset=label_fine_haemosphere!="T-cell") # Filter out T lymphocytes
so_qc <- subset(so_qc, subset=label_fine_haemosphere!="NK") # Filter out T lymphocytes
so_qc <- subset(so_qc, subset=label_main_haemosphere!="NK") # Filter out T lymphocytes
so_qc <- subset(so_qc, subset=seurat_clusters!=18) # Filter out T lymphocytes
so_qc <- subset(so_qc, subset=seurat_clusters!=19) # Filter out B lymphocytes
In [39]:
options(repr.plot.width=22.5, repr.plot.height=20)

dplot_1 <- dplot(so_qc, reduction="umap", group_by="treatment") + scale_color_manual(values=color$treatment) + ggtitle("Treatment")
dplot_2 <- dplot(so_qc, reduction="umap", group_by="tissue") + scale_color_manual(values=color$tissue) + ggtitle("Tissue")
dplot_3 <- dplot(so_qc, reduction="umap", group_by="cc_phase_class") + scale_color_manual(values=color$cc_phase_class) + ggtitle("Cell cycle")
dplot_4 <- dplot(so_qc, reduction="umap", group_by="label_fine_haemosphere") + scale_color_manual(values=color$label_fine_haemosphere) + ggtitle("Haemosphere label")
fplot_1 <- fplot(so_qc, reduction="umap", features="doublet_score_log2") + scale_colour_viridis_c(option="inferno") + ggtitle("Log2(Doublet score)")
dplot_5 <- dplot(so_qc, reduction="umap", group_by="doublet_class") + scale_color_manual(values=c("Doublet"="black", "Singlet"="light grey")) + ggtitle("Doublet class sample")
dplot_6 <- dplot(so_qc, reduction="umap", group_by="doublet_class_int") + scale_color_manual(values=c("Doublet"="black", "Singlet"="light grey")) + ggtitle("Doublet class integrated")
fplot_2 <- fplot(so_qc, reduction="umap", features="pMt_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Mt")
fplot_3 <- fplot(so_qc, reduction="umap", features="pHb_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Hb")
fplot_4 <- fplot(so_qc, reduction="umap", features="pRb_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Percentage Rb")
fplot_5 <- fplot(so_qc, reduction="umap", features="nCount_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("UMI count")
fplot_6 <- fplot(so_qc, reduction="umap", features="nFeature_RNA") + scale_colour_viridis_c(option="inferno") + ggtitle("Feature count")

dplot_comb <- dplot_1 + dplot_2 + dplot_3 + dplot_4 + fplot_1 + dplot_5 + dplot_6 + fplot_2 + fplot_3 + fplot_4 + fplot_5 + fplot_6 + plot_layout(ncol=3)
dplot_comb
Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

In [40]:
options(repr.plot.width=22.5, repr.plot.height=5)

dplot_1 <- dplot(so_qc, reduction="umap", group_by="seurat_clusters", label=TRUE) + ggtitle("Louvain cluster")
fplot_1 <- fplot(so_qc, reduction="umap", features="Trac") + scale_colour_viridis_c(option="inferno") + ggtitle("Trac")
fplot_2 <- fplot(so_qc, reduction="umap", features="Igkc") + scale_colour_viridis_c(option="inferno") + ggtitle("Igkc")

dplot_comb <- dplot_1 + fplot_1 + fplot_2 + plot_layout(ncol=3)
dplot_comb
Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

Scale for 'colour' is already present. Adding another scale for 'colour',
which will replace the existing scale.

In [41]:
unique(so_qc$label_fine_haemosphere)
  1. 'cDC'
  2. 'B-cell'
  3. 'Mo'
  4. 'CMP'
  5. 'pDC'
  6. 'Baso'
  7. 'Mac'
  8. 'GMP'
  9. 'EoP'
  10. 'Eo'
  11. 'preCFUE'
  12. 'pbEry'
  13. 'CFUE'
  14. 'MEP'
  15. 'STHSC'
  16. 'poEry'
  17. 'Meg'
  18. 'Retic'
  19. 'MPP'
  20. 'LSK'
  21. 'Mast'
  22. 'Neu'
In [42]:
unique(so_qc$label_main_haemosphere)
  1. 'Mo'
  2. 'DC'
  3. 'B-cell'
  4. 'RPP'
  5. 'Baso'
  6. 'Mac'
  7. 'Eo'
  8. 'MPP'
  9. 'Ery'
  10. 'Meg'
  11. 'Mast'

Save result objso_qc_rds¶

In [43]:
# Store data as RDS 
saveRDS(so_qc, so_qc_rds)
In [44]:
# Store data as h5ad 
adata <- import("anndata", as="adata", convert=FALSE)
pd <- import("pandas", as="pd", convert=FALSE)
np <- import("numpy", as="np", convert=FALSE)

    
# Transform dgCMatrix to 
X <- GetAssayData(so_qc, assay="RNA", slot="counts") %>% as.matrix() %>% t()
X <- np$array(X, dtype=np$int32)
    
adata <- adata$AnnData(X=X, obs=so_qc@meta.data)
adata$var_names <- rownames(GetAssayData(so_qc, assay="RNA", slot="counts"))

adata$raw <- adata
adata$write_h5ad(so_qc_h5ad)
None

Session info¶

In [45]:
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.2 (Ootpa)

Matrix products: default
BLAS/LAPACK: /home/fdeckert/bin/miniconda3/envs/r.4.1.0-FD20200109SPLENO/lib/libopenblasp-r0.3.15.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] viridis_0.6.2               viridisLite_0.4.0          
 [3] SingleR_1.8.1               SingleCellExperiment_1.16.0
 [5] SummarizedExperiment_1.24.0 Biobase_2.54.0             
 [7] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
 [9] IRanges_2.28.0              S4Vectors_0.32.4           
[11] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[13] matrixStats_0.62.0          ComplexHeatmap_2.10.0      
[15] patchwork_1.1.1             RColorBrewer_1.1-3         
[17] IRdisplay_1.1               kableExtra_1.3.4           
[19] knitr_1.39                  biomaRt_2.50.3             
[21] reticulate_1.24             forcats_0.5.1              
[23] stringr_1.4.0               dplyr_1.0.9                
[25] purrr_0.3.4                 readr_2.1.2                
[27] tidyr_1.2.0                 tibble_3.1.7               
[29] ggplot2_3.3.6               tidyverse_1.3.1            
[31] SeuratWrappers_0.3.0        scDblFinder_1.8.0          
[33] sp_1.4-7                    SeuratObject_4.1.0         
[35] Seurat_4.1.1               

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3            pbdZMQ_0.3-7             
  [3] scattermore_0.8           R.methodsS3_1.8.1        
  [5] bit64_4.0.5               irlba_2.3.5              
  [7] DelayedArray_0.20.0       R.utils_2.11.0           
  [9] data.table_1.14.2         rpart_4.1.16             
 [11] doParallel_1.0.17         KEGGREST_1.34.0          
 [13] RCurl_1.98-1.6            generics_0.1.2           
 [15] ScaledMatrix_1.2.0        cowplot_1.1.1            
 [17] RSQLite_2.2.13            RANN_2.6.1               
 [19] future_1.25.0             bit_4.0.4                
 [21] tzdb_0.3.0                spatstat.data_2.2-0      
 [23] webshot_0.5.3             xml2_1.3.3               
 [25] lubridate_1.8.0           httpuv_1.6.5             
 [27] assertthat_0.2.1          xfun_0.30                
 [29] hms_1.1.1                 evaluate_0.15            
 [31] promises_1.2.0.1          fansi_1.0.3              
 [33] progress_1.2.2            dbplyr_2.1.1             
 [35] readxl_1.4.0              igraph_1.3.1             
 [37] DBI_1.1.2                 htmlwidgets_1.5.4        
 [39] spatstat.geom_2.4-0       ellipsis_0.3.2           
 [41] RSpectra_0.16-1           backports_1.4.1          
 [43] deldir_1.0-6              sparseMatrixStats_1.6.0  
 [45] vctrs_0.4.1               here_1.0.1               
 [47] Cairo_1.5-15              remotes_2.4.2            
 [49] ROCR_1.0-11               abind_1.4-5              
 [51] batchelor_1.10.0          cachem_1.0.6             
 [53] withr_2.5.0               progressr_0.10.0         
 [55] sctransform_0.3.3         prettyunits_1.1.1        
 [57] scran_1.22.1              goftest_1.2-3            
 [59] svglite_2.1.0             cluster_2.1.3            
 [61] lazyeval_0.2.2            crayon_1.5.1             
 [63] labeling_0.4.2            edgeR_3.36.0             
 [65] pkgconfig_2.0.3           nlme_3.1-157             
 [67] vipor_0.4.5               rlang_1.0.2              
 [69] globals_0.14.0            lifecycle_1.0.1          
 [71] miniUI_0.1.1.1            filelock_1.0.2           
 [73] BiocFileCache_2.2.1       modelr_0.1.8             
 [75] rsvd_1.0.5                rprojroot_2.0.3          
 [77] cellranger_1.1.0          polyclip_1.10-0          
 [79] lmtest_0.9-40             Matrix_1.4-1             
 [81] IRkernel_1.3              zoo_1.8-10               
 [83] reprex_2.0.1              base64enc_0.1-3          
 [85] beeswarm_0.4.0            GlobalOptions_0.1.2      
 [87] ggridges_0.5.3            rjson_0.2.21             
 [89] png_0.1-7                 bitops_1.0-7             
 [91] R.oo_1.24.0               KernSmooth_2.23-20       
 [93] Biostrings_2.62.0         blob_1.2.3               
 [95] DelayedMatrixStats_1.16.0 shape_1.4.6              
 [97] parallelly_1.31.1         spatstat.random_2.2-0    
 [99] beachmat_2.10.0           scales_1.2.0             
[101] memoise_2.0.1             magrittr_2.0.3           
[103] plyr_1.8.7                ica_1.0-2                
[105] zlibbioc_1.40.0           compiler_4.1.0           
[107] dqrng_0.3.0               clue_0.3-60              
[109] fitdistrplus_1.1-8        cli_3.3.0                
[111] XVector_0.34.0            listenv_0.8.0            
[113] pbapply_1.5-0             MASS_7.3-57              
[115] mgcv_1.8-40               tidyselect_1.1.2         
[117] stringi_1.7.6             highr_0.9                
[119] BiocSingular_1.10.0       locfit_1.5-9.5           
[121] ggrepel_0.9.1             tools_4.1.0              
[123] future.apply_1.9.0        parallel_4.1.0           
[125] circlize_0.4.14           rstudioapi_0.13          
[127] uuid_1.1-0                foreach_1.5.2            
[129] bluster_1.4.0             metapod_1.2.0            
[131] gridExtra_2.3             farver_2.1.0             
[133] Rtsne_0.16                digest_0.6.29            
[135] BiocManager_1.30.17       rgeos_0.5-9              
[137] shiny_1.7.1               Rcpp_1.0.8.3             
[139] broom_0.8.0               scuttle_1.4.0            
[141] later_1.3.0               RcppAnnoy_0.0.19         
[143] httr_1.4.3                AnnotationDbi_1.56.2     
[145] colorspace_2.0-3          rvest_1.0.2              
[147] XML_3.99-0.9              fs_1.5.2                 
[149] tensor_1.5                splines_4.1.0            
[151] uwot_0.1.11               statmod_1.4.36           
[153] spatstat.utils_2.3-0      scater_1.22.0            
[155] xgboost_1.6.0.1           plotly_4.10.0            
[157] systemfonts_1.0.4         xtable_1.8-4             
[159] jsonlite_1.8.0            R6_2.5.1                 
[161] pillar_1.7.0              htmltools_0.5.2          
[163] mime_0.12                 glue_1.6.2               
[165] fastmap_1.1.0             BiocParallel_1.28.3      
[167] BiocNeighbors_1.12.0      codetools_0.2-18         
[169] utf8_1.2.2                ResidualMatrix_1.4.0     
[171] lattice_0.20-45           spatstat.sparse_2.1-1    
[173] curl_4.3.2                ggbeeswarm_0.6.0         
[175] leiden_0.3.10             survival_3.3-1           
[177] limma_3.50.3              rmarkdown_2.14           
[179] repr_1.1.4                munsell_0.5.0            
[181] GetoptLong_1.0.5          GenomeInfoDbData_1.2.7   
[183] iterators_1.0.14          haven_2.5.0              
[185] reshape2_1.4.4            gtable_0.3.0             
[187] spatstat.core_2.4-2